Andre Nguyen CMSC320 Section 0101 Final Project: Analyzing Gun Violence in USA from 2013 - 2018

Unfortunately, gun violence has been a prevalent issue in the US and to this day continues to wreak havoc throughout the country. The data set that will be used is from https://github.com/jamesqo/gun-violence-data. This data was collected through web scraping the Gun Violence Archive website, where hundreds of thousands of cases are posted throughout the years. This data was converted into a large csv file and contains all cases from 1/1/2013 and 3/31/2018. From this data, we can learn which states had the most gun incidents and deaths due to gun violence. We can also learn what areas are commonly impacted by gun violence. The age groups of people involved in gun violence can also be analyzed. This data set is relatively clean, but for the purposes of this analysis, more cleaning is required.

Notes:

In the first cell, I import all of the Python packages I intend to use throughout this tutorial. I imported pandas and numpy in order to work with data sets as dataframes. I imported matplot and seaborn libraries in order to create different graphs and plots. Finally, I imported sklearn packages in order to create a model to find correlation between different variables and predict desired values.

This csv file was collected and compiled from data that was provided by the Gun Violence Archive website. We can use pandas to read this file into a dataframe to easily manipulate the data. From the GitHub, make sure you unzip this file.

To clean this data set, we will remove all irrelevant columns to our analysis. We will also remove all rows with any NaN values. There are better ways to handle missing data, but in this case it would be very difficult to imputate non-quantitative values. Thus, it is best to simply remove all invalid rows.

The data within the cells of the following columns:

are in a strange format. It is supposed to be a list of dictionaries. :: separates the key-value pair and || indicates the separation of elements in the list. To deal with this, I iterated through each item within the respective column and split the item by ||. Now that I have a list of key-value pairs, I iterate through each element and parse it into a dictionary. Then I create a new column with these dictionaries. I do this for all of the mentioned columns.

During this parsing process, I encountered an unexpected case. Some of the cells have incorrect formatting, so I had to adapt my approach to this.

Now we drop these columns as well, since we have replaced them with ones with improved formatting.

This is our resulting dataframe with all of the data regarding gun incidents from 1/1/13 to 3/31/18. Reminder that in 2013, not all incidents were recorded. After cleaning and parsing, the analysis can begin.

Now, we will start learning from our current dataset. It is important to note that most of the original data has been cleaned, even if they had some entries with meaningful values. I want to find out the total gun incidents that occurred in each state from 2013 to 2018. To do this, I created a dictionary, where each key will be a state, and the value will be the number of times they appeared in the data set. I then iterate through the rows of the dataframe and if the state hasn't been encountered before, it will be initialized in the dictionary with 1 gun incident. Otherwise, the state's number of incidents will increment by 1. After this loop, I will create a bar graph because it would be the easiest to compare the information. A pie chart may be a viable alternative, but there are so many states to compare, some slices of the pie may be too small to acknowledge in a pie chart, ie. District of Columbia.

From this visual, we can learn plenty. Texas, California, and Florida are the top three states in gun incidents, while District of Columbia, Hawaii, and Wyoming are the bottom three states. We can hypothesize that the number of gun incidents are correlated to the size of the state, which makes sense because the higher the population, the more gun violence that can occur. This also makes sense in Texas and Florida, since it is a Republican state. Republicans tend to support gun owning. More research must be done to understand why gun violence is so prevalent in California, as a Democratic state.

The same process is repeated to see the number of deaths from gun violence correlate to number of gun incidents in a state. For the most part, there is a correlation, but there are some special cases, such as Lousiana, Massachusetts, and some others. These states had some gun incidents, but less deaths.

From these two visualizations, we can come to the conclusion that states that are high in population density tend to have more gun incidents.

In this next analysis, we want to find out the general age/gender of suspects and victims. This may provide insight on the type of people that are involved in these gun incidents.

To do this, I created dictionaries to store the ages of suspects and victims. The keys would be the age and the value would be how many suspects/victims were at this age. I also did the same with the gender of suspects and victims. The key is the gender and the value would be the number of suspects/victims with this gender. It is important to note that the dataset only accepts male or female as a valid gender.

Now, we iterate through the rows of the dataframe. For each row, we will go through the dictionary that can tell us which participant is a suspect and iterate through each key. If the key corresponds to Subject-Suspect, then it will be stored into a temporary list. Still in the same row, we will iterate through each key in the dictionary that tells us the age of the participant. If the current key is a Subject-Suspect, then the respective dictionary keys and/or values will be updated accordinly for the suspect dictionarys. Otherwise, do the same for the victims.

Now we plot the information that we gathered into more bar graphs. From the bar graph above, we can see the varying ages of suspects involved in gun incidents. From this bar graph, we can assume that suspects can be between 20 and 60 years old, the majority of them are arond 20 years old. There are some frightening outlier cases in which the suspect is younger than 18, even as young as 0 years old. On the contrary, there are some outlier cases in which suspects are as old as 80. These cases should be looked into individually and should be classified accordingly.

The age of the victims seem to correlate strongly with the age of the suspects. However, we can still draw new conclusions from this visualization. There are significantly more victims under the age of 18 than there are suspects under the age of 18. This is most likely due to mass school shootings. Other than this, the number of victims at a certain age correlate to the number of suspects at the respective age.

The age of the victim does not directly correspond to the age of the suspect. In other words, the one 0 year old did not kill 25 0 year olds. These two graphs are simply showing us the age of the suspects and age of the victims throughout all of the data set.

From this bar graph, it appears to be an 8:1 ratio of male to female suspects in gun incidents.

However, there is approximately a 4:3 ratio of male to female victims. From these two bar graphs, we can see that females are less likely to be suspects, but more likely to be victims in a gun incident.

Now, we are going to use folium in order to visualize where these gun incidents are occurring in the US. First, I initialize the map to look at the United States and adjust the zoom in order to see Alaska and Hawaii as well. Then I iterate through every row of the dataframe and mark the map based on the latitude and longitude of the incident.

From a quick glance, we can easily discern that gun incidents tend to occur in mid-east to east coast and in the west coast of the US. This is most likely due to the fact that there are not many people in the middle of the US. According to the census (https://www.census.gov/popclock/data_tables.php?component=growth), the majority of people live in the West and South. This does not explain why the Northeast has so many gun incidents; more research must be done. If we zoom into these regions, we can see that these incidents tend to occur in major cities in respect to their state. The population density is high in these cities or there are many people living in poverty in these places.

For the next part of the analysis, we will look at how gun types correlate to the number of kills and injuries of each incident. We will create lists to store the number of kills and injuries for each incident, the main gun type involved in the incident (we will determine this by the most "dangerous" gun involved in the incident, ie. if two guns are involved: a handgun and a rifle, the rifle will be the main gun type), and a list of all the main gun codes.

As usual, we will iterate through all rows of the dataframe and access the current gun_type_dict. For each key in the dictionary, append the gun's classification to a temporary list. Afterwards, we will append the main gun type and code, depending on the most "dangerous" gun.

Once we finish looping through the dataframe, we will create a Linear Regression model. We will be using the main gun code as the independent variable and the number of kills/injuries as the dependent variable. After we fit the model and use it to predict values, we will plot our points onto a scatter plot in order to see the line of best fit.

For both of these scatter plots, there appears to be little to no correlation between the main gun involved in the incident and the number of kills/injuries. It appears to be the same with the number of guns as well. As of now, we only can only say that this is because of outliers and multiple outside factors that may or may not be quantifiable.

In order to understand more about why there is no correlation, we must do some more visualizations. Below, I am creating a violin plot to be able to easily see the distribution among the number of kills/injuries. In the previous scatter plot, it clearly shows the range of kills/injuries. However, it doesn't tell us how many times 1 person was killed for each gun. The violin plot solves this problem, but makes it harder to see the less frequent values due to the number of outliers within this data set.

From these new visualizations, we can understand a little more about why there doesn't seem to be a lot of correlation. The violin plot shows that no matter the gun, most gun incidents have at least 1 killed/injured participant. The bar graph above shows that many gun incidents are classified under the "Other" category, which means that the gun was unknown. With less reliable values, it is harder to correlate it to our desired dependent variable.

Despite the lack of correlation, we will make another model to predict number of people killed. Earlier, we created a Linear Regression model to fit the data and now we will use a Decision Tree. We should split our data into training and test data. The training data will be used to predict values and the predicted values will be compared with the test data in order to see how accurate our Decision Tree is. After fitting our data, we find the cross validation score in order to determine how accurate our model is. We will do this using the main gun code and the number of guns involved as our independent variables.

The Decision Tree had an accuracy score of approximately 61% each. This could be more accurate, but the data that is provided is not enough to accurately tell how many participants would be killed/injured. There are too many outside factors to account for that are not quantifiable or difficult to collect.

From this tutorial, we have learned a lot about gun violence. We learned the demographic of the suspects, who tend to be males around the ages of 20 to 30. The demographic of the victems tend to also be around males around the ages of 20 to 30. It is important to note that there are way more female victims than female suspects and this is the case for those under the age of 18. We can see that gun violence is an issue in cities, where the population density tends to be high. We also learned that from this data set, there is little to no correlation to number of guns/types of guns in an incident and the number of participants killed/injured. This is due to outside factors that are not within our data set because of how difficult it is to quantify. There are too many variables too account for when reporting crime.

In the data science lifecycle, we collect, process, analyze, then visualize the data. Then we draw a hypothesis and attempt to create a model to support our hypothesis. The final step is make decisions on what to do next. From the information we have gathered, we can educate policymakers on the state of the country in regards to gun violence. To improve our hypothesis and machine learning models, we can collect more data in order to have a better gauge of the situation. We will then have a better idea on how to approach this problem and be able to save lives.